Data analysis and visualization in R

Modulo 5 – Narrativa visual, EDA, colores y accesibilidad

Author

Albert Tafur Rangel, M.Sc, Ph.D.

Design a data analysis and viz epidemiology project

This document integrates the conceptual, methodological, and technical competencies developed throughout the previous modules of the course. Its objective is to guide the participant through a complete and reproducible workflow for epidemiological data analysis using the R ecosystem. The material covers the sequential stages of a rigorous analytic pipeline, including data importation, inspection of data structures, cleaning and harmonization, exploratory analysis, generation of summary statistics, identification of anomalies, and the construction of visualizations that support analytical reasoning and evidence-based interpretation. Emphasis is placed on the design of effective figures, the evaluation of alternative graphical encodings, and the application of accessibility principles such as perceptually uniform color palettes and color-blind-safe schemes.

The dataset employed consists of records of dengue cases obtained from real epidemiological surveillance sources and complemented with simulated data to illustrate methodological challenges frequently encountered in practice, such as inconsistent formats, missing values, atypical observations, and heterogeneous spatial or temporal resolution. The database includes climatic, demographic, and geographic variables commonly used in epidemiological analyses, enabling the participant to explore associations between disease incidence and environmental factors, assess temporal patterns, and develop visual narratives that communicate findings with clarity, rigor, and interpretability. Through these exercises, participants will consolidate their proficiency in R and strengthen their ability to construct reproducible analyses aligned with best practices in public health data science.

Load the libraries

The analytical workflow begins with the configuration of a coherent and reproducible working environment in R. This step ensures that all required packages are available, loaded, and functioning correctly before any data manipulation or visualization is performed. Establishing a well-defined environment is essential for ensuring consistency across analyses, facilitating collaboration, and preventing errors arising from version incompatibilities or missing dependencies.

In this module, we rely on a set of packages from the tidyverse ecosystem to manage data structures, perform transformations, and construct statistical graphics under a unified grammar. Additional libraries support tasks related to data cleaning, date handling, accessibility evaluation, and the implementation of perceptually uniform color scales—elements fundamental for producing rigorous and reproducible epidemiological visualizations.

Before loading the packages, it is important to verify that they are installed in your local R environment. If a package is missing, it can be installed using the following general syntax:

install.packages(“package_name”)

For example, to install the tidyverse:

install.packages(“tidyverse”)

Once all required packages are installed, they can be loaded as follows:

# Core data science ecosystem
library(tidyverse)      # Data manipulation, summarization, ggplot2
library(janitor)        # Data cleaning, variable standardization
library(lubridate)      # Date and time parsing
library(readxl)         # Importing Excel datasets

# Visualization and color management
library(colorspace)     # Perceptually uniform palettes
library(colorblindcheck) # Accessibility diagnostics for color palettes
library(scales)         # Formatting axes, labels, and scales
library(patchwork)

# Additional utilities
library(knitr)          # Reporting utilities in Quarto/R Markdown

Once the environment has been initialized, participants should confirm that their working directory is correctly set and that the local structure of files is organized to support reproducibility. This includes ensuring that datasets, scripts, and outputs (figures, tables, and derived data) are stored in appropriately labeled folders. A disciplined setup at this stage provides a solid foundation for the subsequent stages of epidemiological data analysis.

A well-organized directory structure is fundamental for ensuring reproducibility, transparency, and efficient collaboration in data-driven epidemiological analyses. As a recommended practice, each project should be contained within a dedicated folder named according to the study or initiative—for example, ASIS. Within this main directory, it is advisable to create a set of subfolders that separate code, data, and analytical outputs. A common and effective structure includes:

  • code/ — scripts used for importing, cleaning, transforming, and analyzing the data.

  • data/

    • raw/ — original datasets stored exactly as received, without modifications.

    • processed/ — cleaned, harmonized, or transformed datasets generated during the analysis.

  • results/

    • figures/ — visualizations produced during the exploratory and inferential steps.

    • tables/ — summary statistics, model outputs, and tabulated results.

This hierarchical organization facilitates traceability, prevents accidental overwriting of original data, and supports a seamless workflow when producing reproducible reports with Quarto. It also enables clear version control and easier communication of analytic decisions during collaborative work or peer review.

Color Definitions and Principles for Consistent Use

A coherent and well-documented color strategy is essential for producing visualizations that are accurate, accessible, and interpretable in epidemiological contexts. Color must not serve merely as decoration; it functions as a perceptual encoding that guides attention, emphasizes contrasts, and supports analytical reasoning. For this reason, we adopt a structured set of palettes—sequential, diverging, and qualitative—each selected according to the type and scale of the variable being represented. All palettes included are perceptually uniform or color-blind-safe, ensuring accessibility and consistency throughout the analytical workflow.

Sequential Palettes

Sequential palettes are appropriate for variables measured on a continuous scale and where magnitude carries interpretive meaning, such as incidence rates, temperature, rainfall, or risk scores. These palettes encode increasing intensity through a smooth progression of luminance.

Tip

Recommended options:

pal_seq_viridis <- colorspace::sequential_hcl(7, palette = "Viridis") 

pal_seq_blues <- colorspace::sequential_hcl(7, palette = "Blues") 

pal_seq_magma <- colorspace::sequential_hcl(7, palette = "Inferno")

Use when:

  • Representing epidemiological counts or rates.

  • Mapping gradients in heatmaps, temporal trends, or geospatial incidence surfaces.

  • Emphasizing low-to-high transitions without categorical breaks.

Diverging Palettes

Diverging palettes are intended for variables with a meaningful central reference point—for example, deviations from baseline, anomalies relative to average temperature, percent change, or differences before and after an intervention. These palettes emphasize both directionality and magnitude.

Tip

Recommended options:

pal_div_blue_red <- colorspace::diverging_hcl(9, palette = "Blue-Red 3") 

pal_div_green_brown <- colorspace::diverging_hcl(9, palette = "Green-Brown")

Use when:

  • Comparing increases vs. decreases in incidence.

  • Visualizing residuals, standardized differences, or temporal anomalies.

  • Communicating values on both sides of a reference threshold.

Qualitative Palettes

Qualitative palettes are suitable for categorical variables with no intrinsic order, such as regions, municipalities, vector species, or diagnostic categories. Colors must be distinguishable and carry equal perceptual weight.

Tip

Recommended options:

pal_qual_dark3 <- colorspace::qualitative_hcl(8, palette = "Dark 3") 

pal_qual_set2 <- colorspace::qualitative_hcl(8, palette = "Set 2") 

pal_qual_rainbow <- colorspace::qualitative_hcl(8, palette = "Harmonic")

Use when:

  • Visualizing multiple administrative units.

  • Differentiating categories with similar epidemiological importance.

  • Avoiding perceptual hierarchies where no order is intended.

Special Palette for Sex-Stratified Comparisons

Epidemiological analyses frequently require comparison between men and women. The use of culturally ambiguous or stereotypical colors is discouraged; instead, we apply a palette that is perceptually balanced, color-blind-safe, and maintains clear contrast between groups.

Tip

Recommended options:

pal_sex <- c( 
    "Mujeres" = "#A9A9A9", # Blue (accessible, stable across palettes) 
    "Hombres" = "#708090" # Vermilion (high contrast, CVD-safe)
 )

Justification:

  • Both colors come from the scientific color palette of Okabe & Ito, designed for color-blind accessibility.

  • The pair exhibits high luminance contrast, ensuring readability in lines, points, and bars.

  • It avoids cultural pink/blue stereotypes while remaining intuitive in analytic presentations.

Color-Accessibility Diagnostics

Ensuring that visualizations are accessible to individuals with color-vision deficiencies (CVD) is a central requirement for scientific communication. Epidemiological analyses frequently inform decision-making among diverse audiences, including public health officials, clinicians, researchers, and community stakeholders. Consequently, all visual encodings must remain interpretable under common forms of color-blindness such as protanopia, deuteranopia, and tritanopia.

To support accessibility, we incorporate systematic diagnostic tools provided by the colorblindcheck package. This package simulates how plots appear under different CVD conditions and evaluates whether the chosen palette preserves sufficient perceptual contrast. These diagnostics should be applied before adopting any palette in recurrent analyses or final reporting.

The following example shows how to evaluate the palette defined for sex-stratified comparisons:

# Evaluate perceptual distinguishability of the sex palette
colorblindcheck::palette_check(pal_sex)
          name n tolerance ncp ndcp min_dist mean_dist max_dist
1       normal 2  16.74923   1    1 16.74923  16.74923 16.74923
2 deuteranopia 2  16.74923   1    1 17.19531  17.19531 17.19531
3   protanopia 2  16.74923   1    0 15.79820  15.79820 15.79820
4   tritanopia 2  16.74923   1    1 18.14135  18.14135 18.14135

This function provides information on contrast ratios and potential ambiguities between colors when viewed under different color-vision profiles. A “pass” indicates that distinctions remain clear across simulated conditions.

Note

Check

Before finalizing a visualization, it is recommended to test the graph using simulated CVD transformations. The deutan(), rotan(), and tritan()function creates a panel showing how the plot appears under normal vision, protanopia, deuteranopia, and tritanopia, respectively.

Example:

set.seed(123)

# Crear secuencia de fechas (12 semanas)
fechas <- seq.Date(from = as.Date("2023-01-01"),
                   by   = "week",
                   length.out = 12)

# Generar datos simulados de casos por sexo
datos <- tibble(
  fecha = rep(fechas, times = 2),
  sexo  = rep(c("Mujeres", "Hombres"), each = length(fechas)),
  casos = c(
    # Mujeres: tendencia suave con fluctuación
    round(runif(12, min = 20, max = 60) + seq(0, 11)*1.5),
    # Hombres: valores ligeramente superiores y con más variabilidad
    round(runif(12, min = 30, max = 75) + seq(0, 11)*2)
  )
)

p <- datos |> 
  ggplot(aes(fecha, casos, color = sexo)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = pal_sex) +
  labs(
    title = "Incidencia de dengue",
    x = "Semana epidemiológica",
    y = "Número de casos") +
  theme_minimal(base_size = 13)

# Generate a 4-panel simulation of color-vision variations
pal_sex_deutan <- deutan(pal_sex)
pal_sex_protan <- protan(pal_sex)
pal_sex_tritan <- tritan(pal_sex)

# show results
list(
  original = pal_sex,
  deutan = pal_sex_deutan,
  protan = pal_sex_protan,
  tritan = pal_sex_tritan
)

p1 <- p

p2 <- p + 
  scale_color_manual(values = pal_sex_deutan) + 
  labs(title = "Deuteranopia")

p3 <- p + 
  scale_color_manual(values = pal_sex_protan) + 
  labs(title = "Protanopia")

p4 <- p + 
  scale_color_manual(values = pal_sex_tritan) + 
  labs(title = "Tritanopia")
# Mostrar comparación
(p1 | p2) /
(p3 | p4)

The resulting panel helps determine whether line overlap or insufficient contrast could obscure epidemiological patterns for color-blind readers.

Sequential and diverging palettes also require diagnostic evaluation, especially when used in heatmaps or geospatial gradients where subtle hue variations carry meaningful information.

ImportantGood Practices for Accessibly Designed Figures

To ensure consistent accessibility across all visualizations:

  • Prefer palettes derived from perceptually uniform color spaces (e.g., CIELAB, HCL).

  • Avoid relying solely on color to encode meaning; incorporate line types, shapes, or annotations when appropriate.

  • Ensure minimum contrast ratios between adjacent colors or classes.

  • Test all finalized figures with deutan(), rotan(), and tritan() prior to publication or dissemination.

By incorporating these diagnostics directly into the analytical workflow, the integrity and inclusiveness of data communication in epidemiological studies are strengthened, promoting clearer interpretation and equitable access to information.

Read the data

First, we need to get the data.

  • We can download all the datasets to work in this module https://github.com/ae-tafur/data_visualization/tree/main/05_projects/excercises/data. Then, pick up any of the datasets availables and load it. Let the folder in downloads dir. Replace “user” and “your_data” by the name of your user and file.

    Note

    data <- read.csv("~/Downloads/example_data.csv")

  • Alternatively, we can just get the data directly from the repo, just by using a url. Here, we are gonna use this option.

url <- "https://raw.githubusercontent.com/ae-tafur/data_visualization/main"

terridata <- read_delim(
  file.path(url,"02_data_exploration/excercises/data/TerriData20570.txt"),
  delim = "|")
Rows: 11140 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "|"
chr (9): Departamento, Entidad, Dimensión, Subcategoría, Indicador, Dato Num...
dbl (4): Código Departamento, Código Entidad, Año, Mes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now explore the data

terridata |> 
  slice(1:20) |> 
  kable(format = "html", table.attr = "class='table table-striped'")
Código Departamento Departamento Código Entidad Entidad Dimensión Subcategoría Indicador Dato Numérico Dato Cualitativo Año Mes Fuente Unidad de Medida
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Código DANE NA 20570 2000 0 DANE Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Región NA Caribe 2000 0 DANE Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Subregión (SGR) NA Norte 2000 0 DNP Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2000 0 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2018 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2019 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2020 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2021 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2022 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2023 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2024 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría de ruralidad NA Rural disperso 2000 0 DNP Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Extensión 859,00 NA 2017 3 IGAC Kilómetros cuadrados
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 27.007,00 NA 2018 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 28.298,00 NA 2019 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 29.017,00 NA 2020 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 29.706,00 NA 2021 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 30.292,00 NA 2022 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 30.844,00 NA 2023 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 31.317,00 NA 2024 12 DANE Personas

As you can see Dato Numérico have a problem, because it was imported as character but it is a number. This is due to region change, in Colombia we use . as separators for miles and , as separators for decimals. So, let’s fix this.

terridata <- terridata |> 
  mutate(`Dato Numérico` = parse_number(`Dato Numérico`, 
                                        locale = locale(decimal_mark = ",",
                                                        grouping_mark = ".")))

terridata |> 
  slice(1:20) |> 
  kable(format = "html", table.attr = "class='table table-striped'")
Código Departamento Departamento Código Entidad Entidad Dimensión Subcategoría Indicador Dato Numérico Dato Cualitativo Año Mes Fuente Unidad de Medida
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Código DANE NA 20570 2000 0 DANE Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Región NA Caribe 2000 0 DANE Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Subregión (SGR) NA Norte 2000 0 DNP Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2000 0 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2018 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2019 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2020 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2021 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2022 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2023 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría ley 617 de 2000 NA 6 2024 12 Ley 617 de 2000 Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Categoría de ruralidad NA Rural disperso 2000 0 DNP Texto
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Extensión 859 NA 2017 3 IGAC Kilómetros cuadrados
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 27007 NA 2018 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 28298 NA 2019 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 29017 NA 2020 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 29706 NA 2021 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 30292 NA 2022 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 30844 NA 2023 12 DANE Personas
20 Cesar 20570 Pueblo Bello Descripción general Descripción general Población total 31317 NA 2024 12 DANE Personas

Perfect, now we can work whit this data

Basic statistics

Check statistics

summary(terridata)
 Código Departamento Departamento       Código Entidad    Entidad         
 Min.   :20          Length:11140       Min.   :20570   Length:11140      
 1st Qu.:20          Class :character   1st Qu.:20570   Class :character  
 Median :20          Mode  :character   Median :20570   Mode  :character  
 Mean   :20                             Mean   :20570                     
 3rd Qu.:20                             3rd Qu.:20570                     
 Max.   :20                             Max.   :20570                     
                                                                          
  Dimensión         Subcategoría        Indicador         Dato Numérico       
 Length:11140       Length:11140       Length:11140       Min.   :-1.039e+04  
 Class :character   Class :character   Class :character   1st Qu.: 3.000e+00  
 Mode  :character   Mode  :character   Mode  :character   Median : 4.500e+01  
                                                          Mean   : 1.536e+08  
                                                          3rd Qu.: 9.090e+02  
                                                          Max.   : 7.435e+10  
                                                          NA's   :1906        
 Dato Cualitativo        Año            Mes           Fuente         
 Length:11140       Min.   :1985   Min.   : 0.00   Length:11140      
 Class :character   1st Qu.:2013   1st Qu.:12.00   Class :character  
 Mode  :character   Median :2018   Median :12.00   Mode  :character  
                    Mean   :2017   Mean   :11.81                     
                    3rd Qu.:2021   3rd Qu.:12.00                     
                    Max.   :2042   Max.   :12.00                     
                                                                     
 Unidad de Medida  
 Length:11140      
 Class :character  
 Mode  :character  
                   
                   
                   
                   

But this results are not useful in this context and data

Creating plots

One of the main use of Terridata is to get data about demographic data. Let’s get build a poblational pyramid

terridata |> 
  filter(Año == 2020 | Año == 2025 | Año == 2030) |> 
  filter(str_starts(Indicador, "Porcentaje de población de") ) |> 
  filter(str_starts(Subcategoría, "Población de") ) |> 
  filter(str_detect(Fuente, "Censo 2018")) |> 
  mutate(`Dato Numérico` = ifelse(str_detect(Indicador, "mujeres"),
                                  -`Dato Numérico`,`Dato Numérico`),
         Indicador = str_remove_all(
               Indicador,
               "[Porcentaje de población de mujeres de hombres de]"),
        Subcategoría = str_remove_all(Subcategoría, "Población de ")) |> 
  ggplot(aes(x = `Dato Numérico`, 
             y = Indicador,
             fill = str_to_title(Subcategoría),
             color = as.character(Año),
             group = as.character(Año))) +
  geom_col(position = "identity") +
  scale_fill_manual(values = pal_sex) +
  scale_color_manual(values = c("#000000", "#9D02D7", "#F5275E")) +
  scale_x_continuous(breaks = seq(-7,7,2), labels = abs) +
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0)) +
  labs(fill = "Grupo",
       color = "Año",
       x = "Porcentaje de la población total",
       y = "Quinquenios de edad")

Another, plot can be the access to energy, water and sanitation.

terridata |> 
  filter(Indicador == "Cobertura de acueducto urbana (REC)" | 
           Indicador == "Cobertura de acueducto rural (REC)" |
           Indicador == "Cobertura de alcantarillado urbana (REC)" |
           Indicador == "Cobertura de alcantarillado rural (REC)" |
           Indicador == "Cobertura de Energía Eléctrica Urbana (Censo)" |
           Indicador == "Cobertura de Energía Eléctrica Rural (Censo)") |>  
  rowwise() |> 
  mutate(Grupo = ifelse(str_detect(str_to_lower(Indicador), "urbana"),
                        "Urbana", "Rural"),
         Indicador = ifelse(str_detect(Indicador, "Energía"), "Energía",
           ifelse(str_detect(Indicador, "acueducto"), 
                  "Acueducto", "Alcantarillado"))) |> 
  ggplot(aes(y = `Dato Numérico`, 
             x = as.character(Año),
             fill = Grupo)) +
  geom_col(position = "dodge") +
  facet_wrap(~Indicador, ncol = 1, scales = "free_x") +
  scale_fill_manual(values = pal_qual_set2) +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.caption = element_text(hjust = 0)) +
  labs(fill = "",
       x = "Año",
       y = "Cobertura (%)")

# Prepare breaks and labels safely
x_breaks <- terridata |> 
  filter(Indicador %in% c(
    "Cobertura de acueducto urbana (REC)", 
    "Cobertura de acueducto rural (REC)",
    "Cobertura de alcantarillado urbana (REC)",
    "Cobertura de alcantarillado rural (REC)",
    "Cobertura de Energía Eléctrica Urbana (Censo)",
    "Cobertura de Energía Eléctrica Rural (Censo)")) |> 
  filter(Año >= 2018) |> 
  distinct(Año) |> 
  arrange(Año) |> 
  mutate(Año_num = as.numeric(as.factor(Año)))

breaks_vec  <- x_breaks$Año_num
labels_vec <- x_breaks$Año

terridata |> 
  filter(Indicador %in% c(
    "Cobertura de acueducto urbana (REC)", 
    "Cobertura de acueducto rural (REC)",
    "Cobertura de alcantarillado urbana (REC)",
    "Cobertura de alcantarillado rural (REC)",
    "Cobertura de Energía Eléctrica Urbana (Censo)",
    "Cobertura de Energía Eléctrica Rural (Censo)")) |>  
  filter(Año >= 2018) |> 
  mutate(Grupo = ifelse(str_detect(str_to_lower(Indicador), "urbana"),
                   "Urbana", "Rural"),
    Indicador = case_when(str_detect(Indicador, "Energía") ~ "Energía",
                          str_detect(Indicador, "acueducto") ~ "Acueducto",
                          TRUE ~ "Alcantarillado"),
    Año_num = as.numeric(as.factor(Año)),
    offset = ifelse(Grupo == "Urbana", -0.15, 0.15),
    x_pos = Año_num + offset) |> 
  ggplot(aes(x = x_pos,
             y = `Dato Numérico`,
             color = Grupo)) +
  geom_segment(aes(x = x_pos,
                   xend = x_pos,
                   y = 0,
                   yend = `Dato Numérico`),
               linewidth = 1.1,
               alpha = 0.8) +
  geom_point(size = 5) +
  geom_text(aes(label = round(`Dato Numérico`, 1)),
            vjust = 0.5,
            hjust = 1.4,
            size = 3.5,
            fontface = "bold",
            show.legend = FALSE) +
  facet_wrap(~Indicador, ncol = 1, scales = "free_x") +
  scale_color_manual(values = pal_qual_set2) +
  scale_x_continuous(breaks = breaks_vec,
                     labels = labels_vec) + 
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom",
        plot.caption = element_text(hjust = 0),
        strip.text = element_text(face = "bold"),
        axis.title.y = element_blank(),
        axis.text.y  = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank()) +
  labs(color = "",
       x = "Año",
       title = "Cobertura por Servicios y Zona (%)")

terridata |> 
  filter(str_detect(Subcategoría, "Acceso a la educación")) |> 
  filter(!str_detect(Indicador, "superior")) |> 
  filter(!str_ends(Indicador, "Total")) |> 
  filter(Año >= 2016) |> 
  rowwise() |> 
  mutate(Grupo = ifelse(str_detect(Indicador, "bruta"), "Bruta", "Neta"),
         Indicador = str_trim(str_sub(Indicador,19))) |> 
  ggplot(aes(y = `Dato Numérico`, 
             x = Año,
             group = Indicador,
             color = Indicador)) +
  geom_point(size = 3, shape = 1) +
  geom_point(size = 1.5) +
  geom_line(size = 0.5) +
  facet_wrap(~Grupo, scales = "free_x", ncol = 1) +
  scale_color_manual(values = pal_qual_dark3,
                     labels = function(x) stringr::str_to_sentence(x)) +
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0)) +
  labs(color = "",
       x = "",
       y = "Cobertura")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

terridata |> 
  filter(Año %in% c(2015, 2022)) |> 
  filter(str_detect(Subcategoría, "Acceso a la educación")) |> 
  filter(!str_detect(Indicador, "superior")) |> 
  mutate(Grupo = ifelse(str_detect(Indicador, "bruta"), "Bruta", "Neta"),
         Indicador = str_trim(str_sub(Indicador, 19))) |> 
  ggplot(aes(x = Año,
             y = `Dato Numérico`,
             group = Indicador,
             color = Indicador)) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  facet_wrap(~Grupo) +
  scale_color_manual(values = pal_qual_dark3,
                     labels = stringr::str_to_sentence) +
  theme_minimal() +
  labs(x = "", 
       y = "Cobertura", 
       color = "")

terridata |> 
  filter(str_detect(Subcategoría, "Acceso a la educación")) |> 
  filter(!str_detect(Indicador, "superior")) |> 
  filter(!str_ends(Indicador, "Total")) |> 
  filter(Año >= 2016) |> 
  mutate(Grupo = ifelse(str_detect(Indicador, "bruta"), "Bruta", "Neta"),
         Indicador = str_trim(str_sub(Indicador, 19))) |> 
  ggplot(aes(x = Año,
             y = `Dato Numérico`,
             fill = Indicador)) +
  geom_area(position = "fill", alpha = 0.85) +
  facet_wrap(~Grupo) +
  scale_fill_manual(values = pal_qual_dark3,
                    labels = stringr::str_to_sentence) +
  theme_minimal() +
  labs(x = "", 
       y = "Proporción", 
       fill = "")

Dengue simulated data

Let’s create a simulated data

set.seed(123)

# Parámetros generales
anios <- 2019:2023
semanas <- 1:52

# Base temporal
datos <- expand.grid(Año = anios,
                     Semana = semanas)

# Componente estacional (patrón típico de dengue)
datos <- datos |>
  mutate(estacional = 20 + 15 * sin(2 * pi * Semana / 52),
         ruido = rpois(n(), lambda = 5),
         casos_base = round(estacional + ruido))

# Crear brote epidémico en 2023 (semanas 20–32)
datos <- datos |>
  mutate(brote = ifelse(Año == 2023 & Semana %in% 20:32,
                        rpois(n(), lambda = 40),
                        0),
         casos = casos_base + brote) |>
  select(Año, Semana, casos)

Now, plot a endemic channel

# Canal endémico a partir de años históricos
canal_endemico <- datos |>
  filter(Año < 2023) |>
  group_by(Semana) |>
  summarise(p10 = quantile(casos, 0.10),
            p25 = quantile(casos, 0.25),
            p50 = quantile(casos, 0.50),
            p75 = quantile(casos, 0.75),
            p90 = quantile(casos, 0.90),
            .groups = "drop")

casos_2023 <- datos |> 
  filter(Año == 2023)

# Datos del año de evaluación
casos_2023 <- datos |>
  filter(Año == 2023)

canal_endemico |> 
  ggplot(aes(x = Semana)) +
  # Zona Epidemia / Brote
  geom_area(aes(y = p90, fill = "Brote"), alpha = 0.8) +
  # Zona Alerta
  geom_area(aes(y = p75, fill = "Alerta"), alpha = 0.8) +
  # Zona Seguridad
  geom_area(aes(y = p25, fill = "Seguridad"), alpha = 0.8) +
  # Zona Éxito
  geom_area(aes(y = p10, fill = "Éxito"), alpha = 0.8) +
  # Percentil central (mediana)
  geom_line(aes(y = p50), color = "black", linewidth = 1) +
  # Casos observados (2023)
  geom_point(data = casos_2023,
             aes(x = Semana, y = casos),
             color = "#D55E00",
             size = 2.2) +
  scale_fill_manual(values = c("Éxito" = "#E5F5E0",
                               "Seguridad" = "#A1D99B",
                               "Alerta" = "#FCBBA1",
                               "Brote" = "#CB181D")) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom",
        panel.grid.minor = element_blank()) +
  labs(
    title = "Canal endémico de dengue (casos semanales)",
    subtitle = "Construido con percentiles históricos (2019–2022)",
    x = "Semana epidemiológica",
    y = "Número de casos",
    fill = "Zona epidemiológica",
    caption = "Puntos: casos observados en 2023 (datos simulados)")